This notebook aims at working on ORB descriptors of a set of pictures, to try to find insights about their similarities and differences.
Among the 200 000 photos we got from Yelp dataset, we manually selected 200 for each of following categories:
Each picture is labelled to its corresponding category:
| name | category | label | |
|---|---|---|---|
| 0 | _YHmgYQovnzexi9GEawZgA.jpg | outside | 3 |
| 1 | _bLipGo4v7eAoprmt3e7hQ.jpg | outside | 3 |
| 2 | _BsXO3m0op5AvstaR1pArg.jpg | outside | 3 |
| 3 | _7ExfZo0unGr9ilqrjbc6A.jpg | outside | 3 |
| 4 | ZPlg8RwKUKqb5CINELxBAA.jpg | outside | 3 |
| ... | ... | ... | ... |
| 795 | TMTti0Az89Tz1a-hPSQFuQ.jpg | beverage | 0 |
| 796 | trxJ56cAxt9WDXC3Z_duNg.jpg | beverage | 0 |
| 797 | TNA-OY4gKUgDPMVTrFacBA.jpg | beverage | 0 |
| 798 | TMmHibQItBz5FwwblyYxgQ.jpg | beverage | 0 |
| 799 | TpjoVp66nCMLR-p43PSKLg.jpg | beverage | 0 |
800 rows × 3 columns
Let's have a look at random pictures:
We'll use the ORB algorithm, which is similar to SIFT and SURF but open-source. Its principle is the same: find keypoints in the image, and describe them in a way that it is invariant to blurring, resizing, rotating, etc.
Let's test on a random picture:
If our descriptors were words, there would only be one way to write each of them (after lemmatization). For instance "dive" is always written that way, and "dove" is a completely different word, even if they only differ by a single letter.
But our descriptors are not that precise: for instance, if we take 2 different pictures of the Eiffel Tower and look at the descriptors corresponding to its peak in each photo, they will be similar, but not necessarily identical. This is equivalent to get a picture described by descriptors 'chaair' and 'chaiir'. So we have to find the "archetypal descriptor" (in our comparison: "chair"), i.e. a descriptor that represents its closest neighbors.
This leads us to try to group our whole lot of descriptors into clusters with a KMeans.
Note : we could also use these descriptors for image recognition since we have labelled data, see this article.
Now we're going to create descriptors for each picture in the dataset.
We'll also add keypoints for each picture, for future visualizations.
Once we'll have this list, next step will be to find the "archetypal descriptors" of these descriptors, by performing a clustering: each centroid will be representative of the descriptors that belong to its cluster.
all_descriptors shape is (390183, 32)
| name | category | label | keypoints | descriptors | |
|---|---|---|---|---|---|
| 0 | _YHmgYQovnzexi9GEawZgA.jpg | outside | 3 | [[<KeyPoint 0x7f90a0171840>, <KeyPoint 0x7f90a... | [[60, 205, 58, 254, 29, 199, 87, 240, 196, 4, ... |
| 1 | _bLipGo4v7eAoprmt3e7hQ.jpg | outside | 3 | [[<KeyPoint 0x7f90a1439870>, <KeyPoint 0x7f90a... | [[50, 241, 105, 72, 223, 201, 151, 239, 15, 10... |
| 2 | _BsXO3m0op5AvstaR1pArg.jpg | outside | 3 | [[<KeyPoint 0x7f90a0178630>, <KeyPoint 0x7f90a... | [[107, 12, 145, 156, 253, 240, 192, 134, 223, ... |
| 3 | _7ExfZo0unGr9ilqrjbc6A.jpg | outside | 3 | [[<KeyPoint 0x7f90a1438540>, <KeyPoint 0x7f90a... | [[24, 211, 94, 175, 50, 3, 63, 125, 36, 5, 211... |
| 4 | ZPlg8RwKUKqb5CINELxBAA.jpg | outside | 3 | [[<KeyPoint 0x7f90a01844b0>, <KeyPoint 0x7f90a... | [[58, 141, 2, 154, 221, 209, 202, 231, 134, 4,... |
| ... | ... | ... | ... | ... | ... |
| 795 | TMTti0Az89Tz1a-hPSQFuQ.jpg | beverage | 0 | [[<KeyPoint 0x7f906c3290c0>, <KeyPoint 0x7f906... | [[128, 151, 44, 186, 220, 167, 125, 39, 158, 8... |
| 796 | trxJ56cAxt9WDXC3Z_duNg.jpg | beverage | 0 | [[<KeyPoint 0x7f906c322c00>, <KeyPoint 0x7f906... | [[54, 5, 3, 202, 157, 231, 131, 75, 132, 5, 20... |
| 797 | TNA-OY4gKUgDPMVTrFacBA.jpg | beverage | 0 | [[<KeyPoint 0x7f906c3349f0>, <KeyPoint 0x7f906... | [[6, 253, 191, 254, 93, 215, 255, 248, 191, 19... |
| 798 | TMmHibQItBz5FwwblyYxgQ.jpg | beverage | 0 | [[<KeyPoint 0x7f906c3ee2a0>, <KeyPoint 0x7f906... | [[230, 32, 212, 128, 237, 43, 232, 73, 15, 178... |
| 799 | TpjoVp66nCMLR-p43PSKLg.jpg | beverage | 0 | [[<KeyPoint 0x7f906c2be4b0>, <KeyPoint 0x7f906... | [[121, 115, 104, 103, 48, 26, 90, 56, 33, 100,... |
800 rows × 5 columns
We'll use KMeans for clustering. As a rule of thumb, we'll set the clusters number k equal to square root of total number of descriptors. In other words, if we got N descriptors, we'll then end up with a "vocabulary" of $ \sqrt N $ visual words.
Chosen clusters number: 625 Number of descriptors / number of clusters = 624.2928 Model training... Model trained.
As said above, if we describe an image by its descriptors, similar images may end being described by slightly different descriptors. As a consequence, they may not be identified as similar (as if we were using the descriptors "cchair" and "chaair" not noticing they are close to each other).
So we will instead describe this image using the "archetypal descriptors", i.e. the centroids of clusters predicted for each descriptor ("chair" in our comparison) by the kmeans model we just trained to group them.
Doing so, we are likely to reduce the number of descriptors of an image, since several descriptors may belong to the same cluster (i.e. may be represented by the same "archetypal descriptor" / centroid).
Once an image is processed, we can "sum it up" by computing the frequency of its archetypal descriptors. This will give us a vector which size will be the number of these archetypal descriptors.
If we keep up with our word comparison (that is to say, let's imagine that our descriptors are strings of letters instead of arrays of numbers), the whole process could be seen as this :
step 1: compute descriptors for our image. We'll get around 500 descriptors such as 'chaiir', 'chaair', 'chhaire', 'tablle', 'taable', 'glas', 'glass', 'plate'... (as said, these fantasy strings are used in our comparison to symbolize arrays of numbers, i.e. our real descriptors)
step 2: clustering descriptors allows to group them to find their 'archetypal' ones, that we use as labels. For instance all terms 'chaiir', 'chaair', 'chhaire' can be converted into their centroid that we'll call 'chair', etc. So our picture can now be described with: 'chair', 'chair', chair', 'table', 'table', 'glass', 'glass', 'plate'.
Conclusion: at first we needed aroud 500 descriptors to describe our picture, but thanks to clustering the descriptors and computing their frequency, we end up with a much smaller dimension.
This process can be seen as the equivalent of creating a BOW, based on a vocabulary made of these "archetypal descriptors". But it implies an extra preparatory step to create them by finding the centroids of descriptors clusters.
| name | category | label | keypoints | descriptors | BOVW | |
|---|---|---|---|---|---|---|
| 0 | _YHmgYQovnzexi9GEawZgA.jpg | outside | 3 | [[<KeyPoint 0x7f90a0171840>, <KeyPoint 0x7f90a... | [[60, 205, 58, 254, 29, 199, 87, 240, 196, 4, ... | [2, 1, 0, 1, 1, 0, 1, 1, 1, 2, 0, 0, 1, 0, 1, ... |
| 1 | _bLipGo4v7eAoprmt3e7hQ.jpg | outside | 3 | [[<KeyPoint 0x7f90a1439870>, <KeyPoint 0x7f90a... | [[50, 241, 105, 72, 223, 201, 151, 239, 15, 10... | [0, 0, 0, 1, 1, 0, 0, 2, 2, 0, 0, 0, 1, 0, 1, ... |
| 2 | _BsXO3m0op5AvstaR1pArg.jpg | outside | 3 | [[<KeyPoint 0x7f90a0178630>, <KeyPoint 0x7f90a... | [[107, 12, 145, 156, 253, 240, 192, 134, 223, ... | [0, 0, 1, 1, 1, 1, 1, 1, 2, 0, 0, 2, 0, 0, 1, ... |
| 3 | _7ExfZo0unGr9ilqrjbc6A.jpg | outside | 3 | [[<KeyPoint 0x7f90a1438540>, <KeyPoint 0x7f90a... | [[24, 211, 94, 175, 50, 3, 63, 125, 36, 5, 211... | [2, 2, 0, 2, 0, 0, 2, 1, 2, 1, 0, 2, 1, 2, 1, ... |
| 4 | ZPlg8RwKUKqb5CINELxBAA.jpg | outside | 3 | [[<KeyPoint 0x7f90a01844b0>, <KeyPoint 0x7f90a... | [[58, 141, 2, 154, 221, 209, 202, 231, 134, 4,... | [3, 3, 1, 0, 0, 3, 1, 0, 0, 3, 1, 0, 3, 1, 0, ... |
| ... | ... | ... | ... | ... | ... | ... |
| 795 | TMTti0Az89Tz1a-hPSQFuQ.jpg | beverage | 0 | [[<KeyPoint 0x7f906c3290c0>, <KeyPoint 0x7f906... | [[128, 151, 44, 186, 220, 167, 125, 39, 158, 8... | [1, 1, 0, 1, 2, 3, 1, 2, 0, 1, 2, 2, 1, 2, 1, ... |
| 796 | trxJ56cAxt9WDXC3Z_duNg.jpg | beverage | 0 | [[<KeyPoint 0x7f906c322c00>, <KeyPoint 0x7f906... | [[54, 5, 3, 202, 157, 231, 131, 75, 132, 5, 20... | [0, 5, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, ... |
| 797 | TNA-OY4gKUgDPMVTrFacBA.jpg | beverage | 0 | [[<KeyPoint 0x7f906c3349f0>, <KeyPoint 0x7f906... | [[6, 253, 191, 254, 93, 215, 255, 248, 191, 19... | [1, 2, 3, 1, 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 0, ... |
| 798 | TMmHibQItBz5FwwblyYxgQ.jpg | beverage | 0 | [[<KeyPoint 0x7f906c3ee2a0>, <KeyPoint 0x7f906... | [[230, 32, 212, 128, 237, 43, 232, 73, 15, 178... | [0, 0, 0, 0, 0, 1, 0, 2, 3, 0, 1, 3, 0, 4, 0, ... |
| 799 | TpjoVp66nCMLR-p43PSKLg.jpg | beverage | 0 | [[<KeyPoint 0x7f906c2be4b0>, <KeyPoint 0x7f906... | [[121, 115, 104, 103, 48, 26, 90, 56, 33, 100,... | [1, 3, 0, 0, 0, 0, 0, 1, 1, 1, 0, 2, 2, 0, 1, ... |
800 rows × 6 columns
We'll use three different visualization methods: Principal Components Analysis, t-SNE and UMAP, to get a 2D projection of these bag of visual word.
Let's normalize the BOVW we just created:
What proportion of variance do the 10 first components explain ?
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5.22% | 2.79% | 2.17% | 1.32% | 1.02% | 0.96% | 0.87% | 0.85% | 0.73% | 0.72% |
On average, each of 800 components should explain 0.12% of variance.
In all projections, we observe that images of menu, in red, is rather distinct from other categories, while the other categories remain quite mixed. An hypothesis is that the menu images features are more specific (many horizontal lines of letters) than other images features.
We saw that keypoints seem to be particularly well identified for the "menu" category. How do they look like on the pictures ?
Let's select the most signifiant visual words for this category, i.e. the visual words that are:
Then, on a sample of menu pictures, we'll display the corresponding keypoints.
We can see them as "stop labels", equivalent of stop words in NLP.
Most frequent visual words in all pictures of dataset, sorted by decreasing frequency
| raw frequency | |
|---|---|
| visual word label | |
| 33 | 1299 |
| 85 | 1249 |
| 205 | 1186 |
| 66 | 1157 |
| 70 | 1139 |
| ... | ... |
| 393 | 853 |
| 224 | 850 |
| 354 | 848 |
| 264 | 846 |
| 111 | 844 |
62 rows × 1 columns
We select those that are not in stop_labels.
10 most frequent visual words in "menu" pictures of dataset, sorted by decreasing frequency (after removing top 10% most common global visual words)
| raw frequency | |
|---|---|
| visual word label | |
| 440 | 350 |
| 434 | 310 |
| 480 | 291 |
| 258 | 287 |
| 605 | 284 |
| 599 | 280 |
| 35 | 275 |
| 9 | 273 |
| 55 | 272 |
| 243 | 271 |
If these keypoints are representative of the menu category, we may not find them (or in a much smaller number) in other categories pictures. Let's try on a sample of 10 pictures from other categories than "menu":
Now we can answer our previous question: are there more "menu" keypoints in "menu" pictures than in other pictures? If yes, this may confirm that these keypoints are discriminant, and explain why the 2D projection shows menu pictures in a rather concentrated area.
| avg nb of keypoints | |
|---|---|
| in menu pics | 64.8 |
| in other pics | 32.3 |